18 research outputs found
Contributions to region-based image and video analysis: feature aggregation, background subtraction and description constraining
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 22-01-2016Esta tesis tiene embargado el acceso al texto completo hasta el 22-07-2017The use of regions for image and video analysis has been traditionally motivated by their ability
to diminish the number of processed units and hence, the number of required decisions. However,
as we explore in this thesis, this is just one of the potential advantages that regions may
provide. When dealing with regions, two description spaces may be differentiated: the decision
space, on which regions are shaped—region segmentation—, and the feature space, on which
regions are used for analysis—region-based applications—. These two spaces are highly related.
The solutions taken on the decision space severely affect their performance in the feature space.
Accordingly, in this thesis we propose contributions on both spaces. Regarding the contributions
to region segmentation, these are two-fold. Firstly, we give a twist to a classical region segmentation
technique, the Mean-Shift, by exploring new solutions to automatically set the spectral
kernel bandwidth. Secondly, we propose a method to describe the micro-texture of a pixel
neighbourhood by using an easily customisable filter-bank methodology—which is based on the
discrete cosine transform (DCT)—. The rest of the thesis is devoted to describe region-based
approaches to several highly topical issues in computer vision; two broad tasks are explored:
background subtraction (BS) and local descriptors (LD). Concerning BS, regions are here used
as complementary cues to refine pixel-based BS algorithms: by providing robust to illumination
cues and by storing the background dynamics in a region-driven background modelling. Relating
to LD, the region is here used to reshape the description area usually fixed for local descriptors.
Region-masked versions of classical two-dimensional and three-dimensional local descriptions are
designed. So-built descriptions are proposed for the task of object identification, under a novel
neural-oriented strategy. Furthermore, a local description scheme based on a fuzzy use of the
region membership is derived. This characterisation scheme has been geometrically adapted to
account for projective deformations, providing a suitable tool for finding corresponding points
in wide-baseline scenarios. Experiments have been conducted for every contribution, discussing
the potential benefits and the limitations of the proposed schemes. In overall, obtained results
suggest that the region—conditioned by successful aggregation processes—is a reliable and
useful tool to extrapolate pixel-level results, diminish semantic noise, isolate significant object
cues and constrain local descriptions. The methods and approaches described along this thesis
present alternative or complementary solutions to pixel-based image processing.El uso de regiones para el análisis de imágenes y secuencias de video ha estado tradicionalmente
motivado por su utilidad para disminuir el número de unidades de análisis y, por ende, el número
de decisiones. En esta tesis evidenciamos que esta es sólo una de las muchas ventajas adheridas
a la utilización de regiones. En el procesamiento por regiones deben distinguirse dos espacios de
análisis: el espacio de decisión, en donde se construyen las regiones, y el espacio de características,
donde se utilizan. Ambos espacios están altamente relacionados. Las soluciones diseñadas para
la construcción de regiones en el espacio de decisión definen su utilidad en el espacio de análisis.
Por este motivo, a lo largo de esta tesis estudiamos ambos espacios. En particular, proponemos
dos contribuciones en la etapa de construcción de regiones. En la primera, revisitamos una
técnica clásica, Mean-Shift, e introducimos un esquema para la selección automática del ancho
de banda que permite estimar localmente la densidad de una determinada característica. En
la segunda, utilizamos la transformada discreta del coseno para describir la variabilidad local
en el entorno de un píxel. En el resto de la tesis exploramos soluciones en el espacio de características,
en otras palabras, proponemos aplicaciones que se apoyan en la región para realizar
el procesamiento. Dichas aplicaciones se centran en dos ramas candentes en el ámbito de la
visión por computador: la segregación del frente por substracción del fondo y la descripción
local de los puntos de una imagen. En la rama substracción de fondo, utilizamos las regiones
como unidades de apoyo a los algoritmos basados exclusivamente en el análisis a nivel de píxel.
En particular, mejoramos la robustez de estos algoritmos a los cambios locales de iluminación y
al dinamismo del fondo. Para esta última técnica definimos un modelo de fondo completamente
basado en regiones. Las contribuciones asociadas a la rama de descripción local están centradas
en el uso de la región para definir, automáticamente, entornos de descripción alrededor
de los puntos. En las aproximaciones existentes, estos entornos de descripción suelen ser de
tamaño y forma fija. Como resultado de este procedimiento se establece el diseño de versiones
enmascaradas de descriptores bidimensionales y tridimensionales. En el algoritmo desarrollado,
organizamos los descriptores así diseñados en una estructura neuronal y los utilizamos para la
identificación automática de objetos. Por otro lado, proponemos un esquema de descripción
mediante asociación difusa de píxeles a regiones. Este entorno de descripción es transformado
geométricamente para adaptarse a potenciales deformaciones proyectivas en entornos estéreo donde las cámaras están ampliamente separadas. Cada una de las aproximaciones desarrolladas
se evalúa y discute, remarcando las ventajas e inconvenientes asociadas a su utilización. En
general, los resultados obtenidos sugieren que la región, asumiendo que ha sido construida de
manera exitosa, es una herramienta fiable y de utilidad para: extrapolar resultados a nivel de
pixel, reducir el ruido semántico, aislar las características significativas de los objetos y restringir
la descripción local de estas características. Los métodos y enfoques descritos a lo largo de esta
tesis establecen soluciones alternativas o complementarias al análisis a nivel de píxelIt was partially supported by the Spanish Government trough
its FPU grant program and the projects (TEC2007-65400 - SemanticVideo), (TEC2011-25995 Event
Video) and (TEC2014-53176-R HAVideo); the European Commission (IST-FP6-027685 - Mesh); the
Comunidad de Madrid (S-0505/TIC-0223 - ProMultiDis-CM) and the Spanish Administration Agency
CENIT 2007-1007 (VISION)
Automatic semantic parsing of the ground-plane in scenarios recorded with multiple moving cameras
Nowadays, video surveillance scenarios usually rely
on manually annotated focus areas to constrain automatic video
analysis tasks. Whereas manual annotation simplifies several
stages of the analysis, its use hinders the scalability of the developed
solutions and might induce operational problems in scenarios
recorded with Multiple and Moving Cameras (MMC). To
tackle these problems, an automatic method for the cooperative
extraction of Areas of Interest (AoIs) is proposed. Each captured
frame is segmented into regions with semantic roles using a stateof-
the-art method. Semantic evidences from different junctures,
cameras and points-of-view are then spatio-temporally aligned
on a common ground plane. Experimental results on widely-used
datasets recorded with multiple but static cameras suggest that
this process provides broader and more accurate AoIs than those
manually defined in the datasets. Moreover, the proposed method
naturally determines the projection of obstacles and functional
objects in the scene, paving the road towards systems focused on
the automatic analysis of human behaviour. To our knowledge,
this is the first study dealing with this problematic, as evidenced
by the lack of publicly available MMC benchmarks. To also cope
with this issue, we provide a new MMC dataset with associated
semantic scene annotationsThis study has been partially supported by the Spanish Government through
its TEC2014-53176-R HAVideo projec
Accurate segmentation and registration of skin lesion images to evaluate lesion change
Skin cancer is a major health problem. There are several techniques to help diagnose skin lesions from a captured image. Computer-aided diagnosis (CAD) systems operate on single images of skin lesions, extracting lesion features to further classify them and help the specialists. Accurate feature extraction, which later on depends on precise lesion segmentation, is key for the performance of these systems. In this paper, we present a skin lesion segmentation algorithm based on a novel adaptation of superpixels techniques and achieve the best reported results for the ISIC 2017 challenge dataset. Additionally, CAD systems have paid little attention to a critical criterion in skin lesion diagnosis: the lesion's evolution. This requires operating on two or more images of the same lesion, captured at different times but with a comparable scale, orientation, and point of view; in other words, an image registration process should first be performed. We also propose in this work, an image registration approach that outperforms top image registration techniques. Combined with the proposed lesion segmentation algorithm, this allows for the accurate extraction of features to assess the evolution of the lesion. We present a case study with the lesion-size feature, paving the way for the development of automatic systems to easily evaluate skin lesion evolutionThis work was supported
in part by the Spanish Government (HAVideo, TEC2014-53176-R) and
in part by the TEC department (Universidad Autonoma de Madrid
Semantic-Aware Scene Recognition
Scene recognition is currently one of the top-challenging research fields in
computer vision. This may be due to the ambiguity between classes: images of
several scene classes may share similar objects, which causes confusion among
them. The problem is aggravated when images of a particular scene class are
notably different. Convolutional Neural Networks (CNNs) have significantly
boosted performance in scene recognition, albeit it is still far below from
other recognition tasks (e.g., object or image recognition). In this paper, we
describe a novel approach for scene recognition based on an end-to-end
multi-modal CNN that combines image and context information by means of an
attention module. Context information, in the shape of semantic segmentation,
is used to gate features extracted from the RGB image by leveraging on
information encoded in the semantic representation: the set of scene objects
and stuff, and their relative locations. This gating process reinforces the
learning of indicative scene content and enhances scene disambiguation by
refocusing the receptive fields of the CNN towards them. Experimental results
on four publicly available datasets show that the proposed approach outperforms
every other state-of-the-art method while significantly reducing the number of
network parameters. All the code and data used along this paper is available at
https://github.com/vpulab/Semantic-Aware-Scene-RecognitionComment: Paper submitted for publication to Elsevier Pattern Recognition
journa
Semantic-aware scene recognition
Scene recognition is currently one of the top-challenging research fields in computer vision. This may be due to the ambiguity between classes: images of several scene classes may share similar objects, which causes confusion among them. The problem is aggravated when images of a particular scene class are notably different. Convolutional Neural Networks (CNNs) have significantly boosted performance in scene recognition, albeit it is still far below from other recognition tasks (e.g., object or image recognition). In this paper, we describe a novel approach for scene recognition based on an end-to-end multi-modal CNN that combines image and context information by means of an attention module. Context information, in the shape of a semantic segmentation, is used to gate features extracted from the RGB image by leveraging on information encoded in the semantic representation: the set of scene objects and stuff, and their relative locations. This gating process reinforces the learning of indicative scene content and enhances scene disambiguation by refocusing the receptive fields of the CNN towards them. Experimental results on three publicly available datasets show that the proposed approach outperforms every other state-of-the-art method while significantly reducing the number of network parameters. All the code and data used along this paper is available at: https://github.com/vpulab/Semantic-Aware-Scene-RecognitionThis study has been partially supported by the Spanish Government through its TEC2017-88169-R MobiNetVideo projec
Graph Convolutional Network for Multi-Target Multi-Camera Vehicle Tracking
This letter focuses on the task of Multi-Target Multi-Camera vehicle
tracking. We propose to associate single-camera trajectories into multi-camera
global trajectories by training a Graph Convolutional Network. Our approach
simultaneously processes all cameras providing a global solution, and it is
also robust to large cameras unsynchronizations. Furthermore, we design a new
loss function to deal with class imbalance. Our proposal outperforms the
related work showing better generalization and without requiring ad-hoc manual
annotations or thresholds, unlike compared approaches
On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data
Pixel-wise image segmentation is key for many Computer Vision applications. The training of deep neural networks for this task has expensive pixel-level annotation requirements, thus, motivating a growing interest on synthetic data to provide unlimited data and its annotations. In this paper, we focus on the generation and application of synthetic data as representative training corpuses for semantic segmentation of urban scenes. First, we propose a synthetic data generation protocol, which identifies key features affecting performance and provides datasets with variable complexity. Second, we adapt two popular weakly supervised domain adaptation approaches (combined training, fine-tuning) to employ synthetic and real data. Moreover, we analyze several backbone models, real/synthetic datasets and their proportions when combined. Third, we propose a new curriculum learning strategy to employ several synthetic and real datasets. Our major findings suggest the high performance impact of pace and order of synthetic and real data presentation, achieving state of the art results for well-known models. The results by training with the proposed dataset outperform popular alternatives, thus demonstrating the effectiveness of the proposed protocol. Our code and dataset are available at http://www-vpu.eps.uam.es/publications/WSDA_semantic/Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work is part of the preliminary tasks related to the SEGA-CV (TED2021-131643A-I00) and the
HVD (PID2021-125051OB-I00) projects funded by the Ministerio de Ciencia e Innovacion of the Spanish
Governmen